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Abstract 


Dimensionality reduction is an important pre-processing step in many applications. Linear discrim- 
inant analysis (LDA) is a classical statistical approach for supervised dimensionality reduction. It 
aims to maximize the ratio of the between-class distance to the within-class distance, thus maximiz- 
ing the class discrimination. It has been used widely in many applications. However, the classical 
LDA formulation requires the nonsingularity of the scatter matrices involved. For undersampled 
problems, where the data dimensionality is much larger than the sample size, all scatter matrices 
are singular and classical LDA fails. Many extensions, including null space LDA (NLDA) and 
orthogonal LDA (OLDA), have been proposed in the past to overcome this problem. NLDA aims 
to maximize the between-class distance in the null space of the within-class scatter matrix, while 
OLDA computes a set of orthogonal discriminant vectors via the simultaneous diagonalization of 
the scatter matrices. They have been applied successfully in various applications. 

In this paper, we present a computational and theoretical analysis of NLDA and OLDA. Our 
main result shows that under a mild condition which holds in many applications involving high- 
dimensional data, NLDA is equivalent to OLDA. We have performed extensive experiments on 
various types of data and results are consistent with our theoretical analysis. We further apply the 
regularization to OLDA. The algorithm is called regularized OLDA (or ROLDA for short). An effi- 
cient algorithm is presented to estimate the regularization value in ROLDA. A comparative study on 
classification shows that ROLDA is very competitive with OLDA. This confirms the effectiveness 
of the regularization in ROLDA. 

Keywords: linear discriminant analysis, dimensionality reduction, null space, orthogonal matrix, 
regularization 


1. Introduction 


Dimensionality reduction is important in many applications of data mining, machine learning, and 
bioinformatics, due to the so-called curse of dimensionality (Bellmanna, 1961; Duda et al., 2000; 
Fukunaga, 1990; Hastie et al., 2001). Many methods have been proposed for dimensionality reduc- 
tion, such as principal component analysis (PCA) (Jolliffe, 1986) and linear discriminant analysis 
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(LDA) (Fukunaga, 1990). LDA aims to find the optimal discriminant vectors (transformation) by 
maximizing the ratio of the between-class distance to the within-class distance, thus achieving the 
maximum class discrimination. It has been applied successfully in many applications including 
information retrieval (Berry et al., 1995; Deerwester et al., 1990), face recognition (Belhumeour 
et al., 1997; Swets and Weng, 1996; Turk and Pentland, 1991), and microarray gene expression data 
analysis (Dudoit et al., 2002). However, classical LDA requires the so-called total scatter matrix to 
be nonsingular. In many applications such as those mentioned above, all scatter matrices in ques- 
tion can be singular since the data points are from a very high-dimensional space and in general the 
sample size does not exceed this dimensionality. This is known as the singularity or undersampled 
problem (Krzanowski et al., 1995). 

In recent years, many approaches have been proposed to deal with such high-dimensional, un- 
dersampled problem, including null space LDA (NLDA) (Chen et al., 2000; Huang et al., 2002), 
orthogonal LDA (OLDA) (Ye, 2005), uncorrelated LDA (ULDA) (Ye et al., 2004a; Ye, 2005), sub- 
space LDA (Belhumeour et al., 1997; Swets and Weng, 1996), regularized LDA (Friedman, 1989), 
and pseudo-inverse LDA (Raudys and Duin, 1998; Skurichina and Duin, 1996). Null space LDA 
computes the discriminant vectors in the null space of the within-class scatter matrix. Uncorrelated 
LDA and orthogonal LDA are among a family of algorithms for generalized discriminant analysis 
proposed in (Ye, 2005). The features in ULDA are uncorrelated, while the discriminant vectors in 
OLDA are orthogonal to each other. Subspace LDA (or PCA+LDA) applies an intermediate di- 
mensionality reduction stage such as PCA to reduce the dimensionality of the original data before 
classical LDA is applied. Regularized LDA uses a scaled multiple of the identity matrix to make 
the scatter matrix nonsingular. Pseudo-inverse LDA employs the pseudo-inverse to overcome the 
singularity problem. More details on these methods, as well as their relationship, can be found in 
(Ye, 2005). In this paper, we present a detailed computational and theoretical analysis of null space 
LDA and orthogonal LDA. 

In (Chen et al., 2000), the null space LDA (NLDA) was proposed, where the between-class 
distance is maximized in the null space of the within-class scatter matrix. The singularity problem 
is thus implicitly avoided. Similar idea has been mentioned briefly in (Belhumeour et al., 1997). 
(Huang et al., 2002) improved the efficiency of the algorithm by first removing the null space of 
the total scatter matrix, based on the observation that the null space of the total scatter matrix is the 
intersection of the null space of the between-class scatter matrix and the null space of the within- 
class scatter matrix. 

In orthogonal LDA (OLDA), a set of orthogonal discriminant vectors is computed, based on 
a generalized optimization criterion (Ye, 2005). The optimal transformation is computed through 
the simultaneous diagonalization of the scatter matrices, while the singularity problem is overcome 
implicitly. Discriminant analysis with orthogonal transformations has been studied in (Duchene and 
Leclerq, 1988; Foley and Sammon, 1975). By a close examination of the computations involved in 
OLDA, we can decompose the OLDA algorithm into three steps: first remove the null space of the 
total scatter matrix; followed by classical uncorrelated LDA (ULDA), a variant of classical LDA 
(details can be found in Section 2.1); and finally apply an orthogonalization step to the transforma- 
tion. 

Both the NLDA algorithm (Huang et al., 2002) and the OLDA algorithm (Ye, 2005) result in or- 
thogonal transformations. However, they applied different schemes in deriving the optimal transfor- 
mations. NLDA computes an orthogonal transformation in the null space of the within-class scatter 
matrix, while OLDA computes an orthogonal transformation through the simultaneous diagonaliza- 
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tion of the scatter matrices. Interestingly, we show in Section 5 that NLDA is equivalent to OLDA, 
under a mild condition C1,! which holds in many applications involving high-dimensional data (see 
Section 7). Based on the equivalence result, an improved algorithm for NLDA, called iNLDA, is 
presented, which further reduces the computational cost of the original NLDA algorithm. 

We extend the OLDA algorithm by applying the regularization technique, which is commonly 
used to stabilize the sample covariance matrix estimation and improve the classification performance 
(Friedman, 1989). The algorithm is called regularized OLDA (or ROLDA for short). The key idea 
in ROLDA is to add a constant A to the diagonal elements of the total scatter matrix. Here À > 0 
is known as the regularization parameter. Choosing an appropriate regularization value is a critical 
issue in ROLDA, as a large À may significantly disturb the information in the scatter matrix, while 
a small A may not be effective in improving the classification performance. Cross-validation is 
commonly used to estimate the optimal À from a finite set of candidates. Selecting an optimal value 
for a parameter such as À is called model selection (Hastie et al., 2001). The computational cost 
of model selection for ROLDA can be expensive, especially when the candidate set is large, since 
it requires expensive matrix computations for each A. We show in Section 6 that the computations 
in ROLDA can be decomposed into two components: the first component involves matrices of 
high dimensionality but independent of A, while the second component involves matrices of low 
dimensionality. When searching for the optimal À from a set of candidates via cross-validation, we 
repeat the computations involved in the second component only, thus reducing the computational 
cost of model selection in ROLDA. 

We have conducted experiments using 14 data sets from various data sources, including low- 
dimensional data from UCI Machine Learning Repository? and high-dimensional data such as text 
documents, face images, and gene expression data. (Details on these data sets can be found in 
Section 7.) We did a comparative study of NLDA, iNLDA, OLDA, ULDA, ROLDA, and Support 
Vector Machines (SVM) (Schókopf and Smola, 2002; Vapnik, 1998) in classification. Experimental 
results show that 


e For all low-dimensional data sets, the null space of the within-class scatter matrix is empty, 
and both NLDA and iNLDA do not apply. However, OLDA is applicable and the reduced 
dimensionality of OLDA is in general k — 1, where k is the number of classes. Condition C1 
holds for most high-dimensional data sets (eight out of nine data sets). NLDA, iNLDA, and 
OLDA achieve the same classification performance, in all cases when condition C1 holds. For 
cases where condition C1 does not hold, OLDA outperforms NLDA and iNLDA, as OLDA 
has a larger number of reduced dimensions than NLDA and iNLDA. These empirical results 
are consistent with our theoretical analysis. 


e iNLDA and NLDA achieve similar performance in all cases. OLDA is very competitive with 
ULDA. This confirms the effectiveness of the final orthogonalization step in OLDA. ROLDA 
achieves a better classification performance than OLDA, which shows the effectiveness of 
the regularization in ROLDA. Overall, ROLDA and SVM are very competitive with other 
methods in classification. 


The rest of the paper is organized as follows. An overview of classical LDA and classical 
uncorrelated LDA is given in Section 2. NLDA and OLDA are discussed in Section 3 and Section 4, 





1. Condition C1 requires that the rank of the total scatter matrix equals to the sum of the rank of the between-class 
scatter matrix and the rank of the within-class scatter matrix. More details will be given in Section 5. 
2. http://www.ics.uci.edu/~mlearn/MLRepository.html 
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Notation || Description Notation | Description 
A data matrix n number of training data points 
m data dimensionality £ reduced dimensionality 
k number of classes Sp between-class scatter matrix 
Sy within-class scatter matrix St total scatter matrix 
G transformation matrix Si covariance matrix of the i-th class 
Ci centroid of the i-th class Nj sample size of the i-th class 
c global centroid K number of neighbors in K-NN 
t rank of S, q rank of Sp 





Table 1: Notation. 


respectively. The relationship between NLDA and OLDA is studied in Section 5. The ROLDA 
algorithm is presented in Section 6. Section 7 includes the experimental results. We conclude in 
Section 8. 

For convenience, Table 1 lists the important notation used in the rest of this paper. 


2. Classical Linear Discriminant Analysis 


Given a data set consisting of n data points (a 9v in IR", classical LDA computes a linear trans- 


formation G € IR"*4 (/ < m) that maps each a j in the m-dimensional space to a vector â; in the 
(-dimensional space by à; — Gla j- Define three matrices H,,, Hp, and S; as follows: 


Ay = [(A1—cie"),++-, (Ag —cxe” )], (1) 


= 


Hy = ~|vn(ci =c), ++, Vice = c)), (2) 


H = (A — ce’), (3) 


aS 


where A = |[a1,---,d,]| is the data matrix, Aj, Ci, Sj, and n; are the data matrix, the centroid, the 
covariance matrix, and the sample size of the i-th class, respectively, c is the global centroid, k is 
the number of classes, and e is the vector of all ones. Then the between-class scatter matrix Sp, 
the within-class scatter matrix Sw, and the total scatter matrix S, are defined as follows (Fukunaga, 
1990): 


Sy = H,HI, Sp = H,HI , and S, = H,H7. 


It follows from the definition (Ye, 2005) that trace(S,,) measures the within-class cohesion, 
trace(S,) measures the between-class separation, and trace(S,) measures the variance of the data 
set, where the trace of a square matrix is the summation of its diagonal entries (Golub and Van 
Loan, 1996). It is easy to verify that S, = Sp + Sw. In the lower-dimensional space resulting from the 
linear transformation G, the scatter matrices become SL = G'S,,G, SE = GT S,G, and SE = G'S,G. 
An optimal transformation G would maximize trace(S) and minimize trace(SL). Classical LDA 
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aims to compute the optimal G by solving the following optimization problem: 


G — arg max trace ((G's,G) E G's,G) f (4) 
GR": GTS,,G=h 


Other optimization criteria, including those based on the determinant could also be used instead 
(Duda et al., 2000; Fukunaga, 1990). The solution to the optimization problem in Eq. (4) is given 
by the eigenvectors of $;;! S corresponding to the nonzero eigenvalues, provided that the within- 
class scatter matrix Sọ is nonsingular (Fukunaga, 1990). The columns of G form the discriminant 
vectors of classical LDA. Since the rank of the between-class scatter matrix is bounded from above 
by k — 1, there are at most k — 1 discriminant vectors in classical LDA. Note that classical LDA does 
not handle singular scatter matrices, which limits its applicability to low-dimensional data. Several 
methods, including null space LDA and orthogonal LDA subspace LDA, were proposed in the past 
to deal with such singularity problem as discussed in Section 1. 


2.1 Classical Uncorrelated LDA 


Classical uncorrelated LDA (cULDA) is an extension of classical LDA. A key property of CULDA 
is that the features in the transformed space are uncorrelated, thus reducing the redundancy in the 
transformed space. 

cULDA aims to find the optimal discriminant vectors that are S;-orthogonal.? Specifically, 
suppose r vectors 61,0», - - , Q, are obtained, then the (r+ 1)-th vector $,..1 is the one that maximizes 
the Fisher criterion function (Jin et al., 2001): 


O” 5,0 
E OT (5) 
f(Q) $T So 
subject to the constraints: 91,156; =0, fori=1,---,r. 


The algorithm in (Jin et al., 2001) finds the discriminant vectors $;'s successively by solving a 
sequence of generalized eigenvalue problems, which is expensive for large and high-dimensional 
data sets. However, it has been shown (Ye et al., 2004a) that the discriminant vectors of CULDA can 
be computed efficiently by solving the following optimization problem: 


G — arg max trace ((G"s,G) E G's,G) : (6) 
GR”: GTs,G=h 


where G = [61,---,0¢], if there exist £ discriminant vectors in cULDA. Note that in Eq. (6), all 
discriminant vectors in G are computed simultaneously. The optimization problem above is a variant 
of the one in Eq. (4). The optimal G is given by the eigenvectors of S7 !Sp. 


3. Null Space LDA 


(Chen et al., 2000) proposed the null space LDA (NLDA) for dimensionality reduction, where 
the between-class distance is maximized in the null space of the within-class scatter matrix. The 
basic idea behind this algorithm is that the null space of Sẹ may contain significant discriminant 
information if the projection of S; is not zero in that direction (Chen et al., 2000; Lu et al., 2003). 





3. Two vectors x and y are S;-orthogonal, if x7 S; y = 0. 
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The singularity problem is thus overcome implicitly. The optimal transformation of NLDA can be 
computed by solving the following optimization problem: 


G = argmaxgr s,G-otrace(G" SG). (7) 


The computation of the optimal G involves the computation of the null space of Sẹ, which may 
be large for high-dimensional data. Indeed, the dimensionality of the null space of S,, is at least 
m 4- k — n, where m is the data dimensionality, k is the number of classes, and n is the sample size. 
In (Chen et al., 2000), a pixel grouping method was used to extract geometric features and reduce 
the dimensionality of samples, and then NLDA was applied in the new feature space. (Huang et al., 
2002) improved the efficiency of the algorithm in (Chen et al., 2000) by first removing the null space 
of the total scatter matrix S+. It is based on the observation that the null space of S, is the intersection 
of the null space of S, and the null space of Sw, as S; = Sy + Sp. 

We can efficiently remove the null space of S, as follows. Let H, = ULV" be the Singular Value 
Decomposition (SVD) (Golub and Van Loan, 1996) of H;, where H, is defined in Eq. (3), U and V 


are orthogonal, 
(M3 0 
CEN 


Y, € IR" is diagonal with the diagonal entries sorted in the non-increasing order, and t = rank(S; ). 
Then 


2 
S, HH] = UXy'vy'u* -uxr'y' =U ( A : ) U7. (8) 
Let U = (U1,U2) be a partition of U with U; € IR”” and Uz € IR”). Then the null space of S; 
can be removed by projecting the data onto the subspace spanned by the columns of Uj. Let Sp, Sw, 
and $, be the scatter matrices after the removal of the null space of S,. That is, 


Sy 2ULSQU;, Sy = ULSSU;, and $, = U7 S,U;. 


Note that only U; is involved for the projection. We can thus apply the reduced SVD computation 
(Golub and Van Loan, 1996) on H, with the time complexity of O(mn?), instead of O(m?n). When 
the data dimensionality m is much larger than the sample size n, this leads to a big reduction in 
terms of the computational cost. 

With the computed Ui, the optimal transformation of NLDA is given by G = UN, where N is 
obtained by solving the following optimization problem: 


N = argmaxyrs. y—otrace(N TS,N). (9) 


That is, the columns of N lie in the null space of $,, while maximizing trace(NT N). 

Let W be the matrix so that the columns of W span the null space of $,. Then N — WM, for 
some matrix M, which is to be determined next. Since the constraint in Eq. (9) is satisfied with 
N = WM for any M, the optimal M can be computed by maximizing 

trace(M" WT $WM). 


By imposing the orthogonality constraint on M (Huang et al., 2002), the optimal M is given by the 
eigenvectors of WT W corresponding to the nonzero eigenvalues. With the computed U1, W, and 
M above, the optimal transformation of NLDA is given by 


G=U\WM. 
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Algorithm 1: NLDA (Null space LDA) 

Input: data matrix A 

Output: transformation matrix G 

1. Form the matrix H, as in Eq. (3); 

2. Compute the reduced SVD of H, as H, = U; EV: 

3. Form the matrices $5 = UT SQU| and $,, = UT S„U1; 

4. Compute the null space, W, of $,, via the eigen-decomposition; 

5. Construct the matrix M, consisting of the top eigenvectors of WT SW; 
6. G— UWM. 











In (Huang et al., 2002), the matrix W is computed via the eigen-decomposition of $,. More 
specifically, let 


$,- wu dew ) w, WI 


be its eigen-decomposition, where [W, W] is orthogonal and A,, is diagonal with positive diagonal 
entries. Then W forms the null space of $,. The pseudo-code for the NLDA algorithm is given in 
Algorithm 1. 


4. Orthogonal LDA 


Orthogonal LDA (OLDA) was proposed in (Ye, 2005) as an extension of classical LDA. The dis- 
criminant vectors in OLDA are orthogonal to each other. Furthermore, OLDA is applicable even 
when all scatter matrices are singular, thus overcoming the singularity problem. It has been applied 
successfully in many applications, including document classification, face recognition, and gene 
expression data classification. The optimal transformation in OLDA can be computed by solving 
the following optimization problem: 


G = argmax pm trace ((G^ SG) * G' SG) , (10) 


:GT G-I; 


where M * denotes the pseudo-inverse of matrix M (Golub and Van Loan, 1996). The orthogonality 

condition is imposed in the constraint. The computation of the optimal transformation of OLDA is 

based on the simultaneous diagonalization of the three scatter matrices as follows (Ye, 2005). 
From Eq. (8), U2 lies in the null space of both S, and Sẹ. Thus, 


T T 
U'S,U = ( Ui ab : ) UTS,U = ( d a : i; (11) 


Denote B = X LU H, and let B = PXQ’ be the SVD of B, where P and Q are orthogonal and Y is 
diagonal. Define the matrix X as 


E XP 0 
x=u( i LA (12) 


It can be shown (Ye, 2005) that X simultaneously diagonalizes Sp, Sw, and S+. That is 


X'S,X = Dy, X! S,X = Dy, and X! SX = D,, (13) 


1189 


YE AND XIONG 





Algorithm 2: OLDA (Orthogonal LDA) 

Input: data matrix A 

Output: transformation matrix G 

1. Compute U1, X, and P; 

2. X, — UIE | P,, where q = rank(S;); 

3. Compute the QR decomposition of X, as X, = QR; 
4. G — Q. 











where Dp, Dwy, and D, are diagonal with the diagonal entries in D; sorted in the non-increasing order. 
The main result in (Ye, 2005) has shown that the optimal transformation of OLDA can be computed 
through the orthogonalization of the columns in X, as summarized in the following theorem: 


Theorem 4.1 Let X be the matrix defined in Eq. (12) and let X, be the matrix consisting of the first 
q columns of X, where q = rank(Sp). Let X, = QR be the QR-decomposition of X}, where Q has 
orthonormal columns and R is upper triangular. Then G — Q solves the optimization problem in 
Eq. (10). 


From Theorem 4.1, only the first q columns of X are used in computing the optimal G. From 
Eq. (12), the first q columns of X are given by 


X; = UIE P}, (14) 


where P, consists of the first g columns of the matrix P. We can observe that U; corresponds to the 
removal of the null space of S; as in NLDA, while X; 1P; is the optimal transformation when clas- 
sical ULDA is applied to the intermediate (dimensionality) reduced space by the projection of U4. 
The OLDA algorithm can thus be decomposed into three steps: (1) Remove the null space of S;; (2) 
Apply classical ULDA as an intermediate step, since the reduced total scatter is nonsingular; and (3) 
Apply an orthogonalization step to the transformation, which corresponds to the QR decomposition 
of X, in Theorem 4.1. The pseudo-code for the OLDA algorithm is given in Algorithm 2. 


Remark 1 The ULDA algorithm in (Ye et al., 2004a; Ye, 2005) consists of steps 1 and 2 above, 
without the final orthogonalization step. Experimental results in Section 7 show that OLDA is com- 
petitive with ULDA. The rationale behind this may be that ULDA involves the minimum redundancy 
in the transformed space and is susceptible to overfitting; OLDA, on the other hand, removes the 
R matrix through the QR decomposition in the final orthogonalization step, which introduces the 
redundancy in the reduced space, but may be less susceptible to overfitting. 


5. Relationship Between NLDA and OLDA 


Both the NLDA algorithm and the OLDA algorithm result in orthogonal transformations. Our 
empirical results show that they often lead to similar performance, especially for high-dimensional 
data. This implies there may exist an intrinsic relationship between these two algorithms. In this 
section, we take a closer look at the relationship between NLDA and OLDA. More specifically, we 
show that NLDA is equivalent to OLDA, under a mild condition 


C1 : rank(S;) = rank(Sp) + rank(S,,), (15) 
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which holds in many applications involving high-dimensional data (see Section 7). It is easy to 
verify from the definition of the scatter matrices that rank(S,) < rank(S,) + rank(S,,). 
From Eqs. (8) and (11), the null space, U2, of S, can be removed, as follows: 


5, = UÏ SU; = UT SU, +0] SU, =8, + Sp E R^. 


Since the null space of S, is the intersection of the null space of S; and the null space of Sẹ, the 
following equalities hold: 


rank(S;) = rank(S;) =t, rank(S,) = rank(S;), and rank(S,,) = rank(S,,). 
Thus condition C1 is equivalent to 
rank(S;) = rank(S,) +rank(S,,). 


The null space of S$; and the null space of $,, are critical in our analysis. The relationship between 
these two null spaces is studied in the following lemma. 


Lemma 5.1 Let Š, $5, and $,, be defined as above and t = rank(S;). Let (wi,---,w,] forms an 
orthonormal basis for the null space of Sy, and let {b1,---,bs} forms an orthonormal basis for the 
null space of Sp. Then, (wi, -- wy, bi, - , b, are linearly independent. 


Proof Prove by contradiction. Assume there exist &;’s and D 's, not all zeros, such that 


r s 
Yaw; + Y Bjb; =0. 
i=l FE 


It follows that 


T ; 5 
0= [Zor Eno) Su (Zamir È Be) = (Eso) 
i-l j=l i=l j=l J=l 


M 
Sw (È 2 , 
j=l 
since w;'s lie in the null space of S,,. Hence, 


2 p ^) E (Y p 2 = (E 2) = (E 2) + È 2) : Si (E 2) 


= 0. 


T 


Since S; is nonsingular, we have Ya Bjb; =0. Thus D; = 0, for all j, since {b),---,bs} forms an 
orthonormal basis for the null space of Sp. 
Similarly, we have 


T 
0= [Zam 2: 2) Si | 
=I jal 


( 


T 
s r r 
Qiwi + Y ws) = ( vm Sp (E v : 
1 j=l i=l i=1 


T T T 
r r r r 

2 S; (É 2 = (E we Sw (£ ww + | 2 Sp p 2 
1 i=l icd i=1 i=l i=l 


= 0. 


Ton 


and 


ims 
ires 
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Hence )';.., jw; = 0, and a; = 0, for all i, since {w,,---,w,} forms are orthonormal basis for the 
null space of S,,. This contradicts our assumption that not all of the a;’s and the B;’s are zero, Thus, 
[wi,: we bis, b,) are linearly independent. E 


Next, we show how to compute the optimal transformation of NLDA using these two null 
spaces. Recall that in NLDA, the null space of S, may be removed first. In the following dis- 
cussion, we work on the reduced scatter matrices $,,, $5, and S, directly as in Lemma 5.1. The main 
result is summarized in the following theorem. 


Theorem 5.1 Let U;, S, Sp, and S, be defined as above and t = rank(S;). Let R = [W,B], where 
W = Wwi,:-,w;] B= [b1,---, Ds], and {w1,:--,wr,b1,: +- ,bs} are defined as in Lemma 5.1. Assume 
that condition C1: rank(S;) = rank(Sp) + rank(S,,) holds. Then G = U,WM solves the optimization 
problem in Eq. (9), where the matrix M, consisting of the eigenvectors of WTS,W, is orthogonal. 


Proof From Lemma 5.1, {w1,---,w,,b1,-:+,bs} € IR’ is linearly independent. Condition C1 implies 
that t = r+ s. Thus {w,---,w,,b1,---,bs} forms a basis for IR‘, that is, R = [W, B] is nonsingular. 
It follows that 


R'S,R -- RT S,R 
WTS,w WTS,B WTS,W WTS,B 
B'SQW | Bl'S,B B'S,W  B’S,,B 


WTSw 0 0 0 
= $ Ks ] 
0 0 0 B'S,B 
Since matrix RT S,R has full rank, WT $,W, the projection of Š, onto the null space of Š, is non- 
singular. Let WT SW = MA,MT be the eigen-decomposition of W7S,W, where M is orthogonal 


and A, is diagonal with positive diagonal entries (note that WT 5,W is positive definite). Then, from 
Section 3, the optimal transformation G of NLDA is given by G = UiW M. a 


R'S,R 


Recall that the matrix M in NLDA is computed so that trace(MT WT $,W M) is maximized. Since 
trace(QAQ' ) = trace(A) for any orthogonal Q, the solution in NLDA is invariant under an arbitrary 
orthogonal transformation. Thus G = UW is also a solution to NLDA, since M is orthogonal, as 
summarized in the following corollary. 


Corollary 5.1 Assume condition C1: rank(S;) = rank(S,) + rank(S,,) holds. Let U; and W be 
defined as in Theorem 5.1. Then G = U\W solves the optimization problem in Eq. (9). That is, 
G = UW is an optimal transformation of NLDA. 


Corollary 5.1 implies that when condition C1 holds, Step 5 in Algorithm 1 may be removed, 
as well as the formation of $, in Step 3 and the multiplication of UW with M in Step 6. This 
improves the efficiency of the NLDA algorithm. The improved NLDA (NLDA) algorithm is given 
in Algorithm 3. Note that it is recommended in (Liu et al., 2004) that the maximization of the 
between-class distance in Step 5 of Algorithm 1 should be removed to avoid possible overfitting. 
However, Corollary 5.1 shows that under condition C1, the removal of Step 5 has no effect on the 
performance of the NLDA algorithm. 

Next, we show the equivalence relationship between NLDA and OLDA, when condition C1 
holds. The main result is summarized in the following theorem. 
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Algorithm 3: iNLDA (improved NLDA) 

Input: data matrix A 

Output: transformation matrix G 

1. Form the matrix H, as in Eq. (3); 

2. Compute the reduced SVD of H, as H, = U; EMI. 

3. Construct the matrix $,, = UTS Ui; 

4. Compute the null space, W, of $,,, via the eigen-decomposition; 
5. G — UW. 











Theorem 5.2 Assume that condition C1: rank(S;) = rank(S,) + rank(Sy) holds. Let U; and W be 
defined as in Theorem 5.1. Then, G = UW solves the optimization problem in Eq. (10). That is, 
under the given assumption, OLDA and NLDA are equivalent. 


Proof Recall that the optimization involved in OLDA is 
G = argmax ppm, 5, ; trace (EASE) (16) 


where SŁ = G'S,G and SE = G'S,G. From Section 4, the maximum number, @, of discriminant 
vectors is no larger than q, which is the rank of Sp. Recall that 


q = rank(S,) = rank(S,) = rank(S;) — rank($,,.) = r, 


where r is the dimension of the null space of S,,. 
Based on the property of the trace of matrices, we have 


trace (Sr) Sf) + trace (5585) — trace (GS) 887) = rank(SP) < q =r, 


where the second equality follows since trace (ATA) = rank(A) for any square matrix A, and the 
inequality follows since the rank of S^ € IR? is at most £ < q. 

It follows that trace ((S7) * Sz) < r, since trace ((S7) * SL), the trace of the product of two positive 
semi-definite matrices, is always nonnegative. Next, we show that the maximum is achieved, when 
G —UW. 

Recall that the dimension of the null space, W, of S, is r. That is, W € IR", It follows that 
(U1W)T S, (U1W) € IR", and rank((U1W)? S (U1W)) = r. Furthermore, 


(U1W)  S,(UW) — W' S,W — 0, 
as W forms the null space of $,,. It follows that, 
trace (Wiw) s uw) (UW)"S,(UW)) - 0. 
Hence, 
trace ((u wy s. uiw))* (Ui W)'sUW)) = rank ((UW)" S (U1W)) 


— trace ((uW)'s.quiw))* ((UiW)?S.(UiW)) J =r. 
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Thus G = U,W solves the optimization problem in Eq. (10). That is, OLDA and NLDA are equiva- 
lent. a 


Theorem 5.2 above shows that under condition C1, OLDA and NLDA are equivalent. Next, we 
show that condition C1 holds when the data points are linearly independent as summarized below. 


Theorem 5.3 Assume that condition C2, that is, the n data points in the data matrix A € IR™*" are 
linearly independent, holds. Then condition C1: rank(S,) = rank(S,) + rank(S,,) holds. 


Proof Since the n columns in A are linearly independent, H, = A — cel is of rank n — 1. That is, 
rank(S;) 2 n — 1. Next we show that rank($5) = k — 1 and rank(Sw) = n — k. Thus condition C1 
holds. 

It is easy to verify that rank(S5) < k — 1 and rank(S,,) < n — k. We have 


n — 1 — rank(S;) € rank(S,) +rank(S,) € (k—1) 4-(n—k) 2n— 1. (17) 

It follows that all inequalities in Eq. (17) become equalities. That is, 
rank(S,) — k — 1, rank(S,,) 2 n — k, and rank(S,) = rank(S5) + rank(S,,). (18) 
Thus, condition C1 holds. E 


Our experimental results in Section 7 show that for high-dimensional data, the linear indepen- 
dence condition C2 holds in many cases, while condition C1 is satisfied in most cases. This explains 
why NLDA and OLDA often achieve the same performance in many applications involving high- 
dimensional data, such as text documents, face images, and gene expression data. 


6. Regularized Orthogonal LDA 


Recall that OLDA involves the pseudo-inverse of the total scatter matrix, whose estimation may not 
be reliable especially for undersampled data, where the number of dimensions exceeds the sample 
size. In such case, the parameter estimates can be highly unstable, giving rise to high variance. 
By employing a method of regularization, one attempts to improve the estimates by regulating this 
bias variance trade-off (Friedman, 1989). We employ the regularization technique to OLDA by 
adding a constant À to the diagonal elements of the total scatter matrix. Here 1 > 0 is known as 
the regularization parameter. The algorithm is called regularized OLDA (ROLDA). The optimal 
transformation, G”, of ROLDA can be computed by solving the following optimization problem: 


G” = argmax c st trace ((c (S-ExLOG)" G'$,G) (19) 


The optimal G" can be computed by solving an eigenvalue problem as summarized in the following 
theorem (The proof follows Theorem 3.1 in (Ye, 2005) and is thus omitted): 


Theorem 6.1 Let X; be the matrix consisting of the first q eigenvectors of the matrix 
(S, + Amn) tSp (20) 


corresponding to the nonzero eigenvalues, where q = rank(Sy). Let X; = QR be the QR-decomposition 
of X», where Q has orthonormal columns and R is upper triangular. Then G = Q solves the opti- 
mization problem in Eq. (19). 
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Theorem 6.1 implies that the main computation involved in ROLDA is the eigen-decomposition of 
the matrix (S, + Min) | Sp. Direct formation of the matrix is expensive for high-dimensional data, 
as it is of size m by m. In the following, we present an efficient way of computing the eigen- 
decomposition. Denote 

BY = (X2 An)? UT H, Q1) 


and let 
B= PS (22) 
be the SVD of B”. From Eqs. (8) and (11), we have 


2 -1 T. 





0 AC ae 0 0 
— yl QM) UBHU 0 \ yr 
0 0 
o gh QM) PE(BY GEM)? 0) ar 
0 0 
x u( (Z2)? Pr (ET (P) G2)? 0 jur 
i 0 0 l 


It follows that the columns of the matrix 
U(X? +A) PP 


form the eigenvectors of (S, + Am) !Sp corresponding to the top q nonzero eigenvalues, where P; 
denotes the first q columns of P". That is, X7 in Theorem 6.1 is given by 


X; SU (2 Aj) P PF. (23) 


The pseudo-code for the ROLDA algorithm is given in Algorithm 4. The computations in ROLDA 
can be decomposed into two components: the first component involves the matrix, U} € IR", of 
high dimensionality but independent of A, while the second component involves the matrix, 


(E? +A) PPI e R”, 


of low dimensionality. When we apply cross-validation to search for the optimal À from a set of 
candidates, we repeat the computations involved in the second component only, thus making the 
computational cost of model selection small. 
More specifically, let 
A= Qa, Aa} Q4) 


be the candidate set for the regularization parameter A, where |A| denotes the size of the candidate 
set A. We apply v-fold cross-validation for model selection (we choose v — 5 in our experiment), 
where the data is divided into v subsets of (approximately) equal size. All subsets are mutually 
exclusive, and in the i-th fold, the i-th subset is held out for testing and all other subsets are used for 
training. For each A; (j = 1,---,|A|), we compute the cross-validation accuracy, Accu(j), defined 
as the mean of the accuracies for all folds. The optimal regularization value Àj» is the one with 


j* = argmax Accu( j). (25) 
j 
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Algorithm 4: ROLDA (Regularized OLDA) 

Input: data matrix A and regularization value À 
Output: transformation matrix G” 

1. Compute U1, X;, and Pj, where q = rank(S; ); 

2. X} — U (EP + Ay) PF; 

3. Compute the QR decomposition of X; as X; = QR; 
4. G' — Q. 











The K-Nearest Neighbor algorithm with K — 1, called 1-NN, is used for computing the accuracy. 
The pseudo-code for the model selection procedure in ROLDA is given in Algorithm 5. Note that 
we apply the QR decomposition to 


(Zp +A.) ^ P. e R” (26) 
instead of 
X; = U (2 +A) P e ma, (27) 
as done in Theorem 6.1, since U; has orthonormal columns. 


Algorithm 5: Model selection for ROLDA 
Input: data matrix A and candidate set A = {A1,---, Ajay} 
Output: optimal regularization value Àj: 
1. Fori=1:v /* y-fold cross-validation */ 
2. Construct A! and A’ : 

[* A! — i-th fold, for training and A! — rest, for testing */ 
3. Construct H, and H; using A’ as in Eqs. (2) and (3), respectively; 
4. Compute the reduced SVD of H, as H, = U£, Vf ; t — rank(H;); 
5. Api UT Hp, q — rank(H,); 
6 
7 
8 











Ai — UTA’; Ai, — UTA’; /* Projection by U, */ 
For j-1:|A| Æ di Ma */ 
! Dj — (2X4) 7. Br — DH, y 
9. Compute the SVD of B" as B" = P'$7(Q")T ; 
10. D,.p — DjP;; Compute the QR decomposition of D, p as D;.p = QR; 
11. Ai — QTAi ; Ai — QT AI; 
12. Run 1-NN on (aiai) and compute the accuracy, denoted as Accu(i, j); 
13. EndFor 
14. EndFor 
15. Accu(j) — +£}; Accu(i, j); 
16. j* — argmax; Accu( j); 
17. Output Àj» as the optimal regularization value. 





6.1 Time Complexity 


We conclude this section by analyzing the time complexity of the model selection procedure de- 
scribed above. 
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Line 4 in Algorithm 5 takes O(r?m) time for the reduced SVD computation. Lines 5 and 6 
take O(mtk) = O(mnk) and O(tmn) = O(mn?) time, respectively, for the matrix multiplications. 
For each Aj, for j = 1,---,|AJ, of the "For" loop, Lines 9 and 10 take O(tk*) = O(nk?) time for the 
SVD and QR decomposition and matrix multiplication. Line 11 takes O(ktn) = O(kr?) time for 
the matrix multiplication. The computation of the classification accuracy by 1-NN in Line 12 takes 
O(nk/v) time, as the size of the test set, A^, is about n/v. Thus, the time complexity, T (|A|), of the 
model selection procedure is 


T(JA) = O(v (nó m 4- mr? +mnk + |A| (nk? + kn? +n?k/v))) à 


For high-dimensional and undersampled data, where the sample size, n, is much smaller than the 
dimensionality m, the time complexity is simplified to 


T(|A|) = O(v(n?m+ |A\n7k)) =O (wm ( + IA) 


When the number, k, of classes in the data set is much smaller than the dimensionality, m, the over- 
head of estimating the optimal regularization value among a large candidate set may be small. Our 
experiments on a collection of high-dimensional and undersampled data (see Section 7) show that 
the computational cost of the model selection procedure in ROLDA grows slowly as |A| increases. 


7. Experimental Studies 


In this section, we perform extensive experimental studies to evaluate the theoretical results and the 
ROLDA algorithm presented in this paper. Section 7.1 describes our test data sets. We perform a 
detailed comparison of NLDA, iNLDA, and OLDA in Section 7.2. Results are consistent with our 
theoretical analysis. In Section 7.3, we compare the classification performance of NLDA, iNLDA, 
OLDA, ULDA, ROLDA, and SVM. The K-Nearest-Neighbor (K-NN) algorithm with K — 1 is used 
as the classifier for all LDA based algorithms. 


7.1 Data Sets 


We used 14 data sets from various data sources in our experimental studies. The statistics of our test 
data sets are summarized in Table 2. 

The first five data sets, including spambase,* balance, wine, waveform, and vowel, are low- 
dimensional data from the UCI Machine Learning Repository. The next nine data sets, including 
text documents, face images, and gene expression data, have high dimensionality: rel, re0, and 
tr41 are three text document data sets, where rel and reO are derived from Reuters-21578 text 
categorization test collection Distribution 1.0? and tr41 is derived from the TREC-5, TREC-6, 
and TREC-7 collections; ORL,’ AR,® and PIX? are three face image data sets; GCM, colon, and 
ALLAML4 are three gene expression data sets (Ye et al., 2004b). 





. Only a subset of the original spambase data set is used in our study. 
. http://www.daviddlewis.com/resources/testcollections/reuters2 1578/ 
. http://trec.nist.gov 

. http://www.uk.research.att.com/facedatabase.html 

. http://rvl1.ecn.purdue.edu/~aleix/aleix_face_DB.html 

. http://peipa.essex.ac.uk/ipa/pix/faces/manchester/test-hard/ 


wo o0o-1o]g tA RR; 
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Data Set Sample size (n) # of dimensions | # of classes 
training test total (m) (k) 
spambase 400 600 1000 56 2 
balance 416 209 625 4 3 
wine 118 60 178 13 3 
waveform 300 500 800 21 3 
vowel 528 462 990 10 11 
rel — — 490 3759 5 
re0 — — 320 2887 4 
tr41 — — 210 7454 7 
ORL — — 400 10304 40 
AR — — 650 8888 50 
PIX — — 300 10000 30 
GCM — — 198 16063 14 
colon — — 62 2000 2 
ALLAML4 — — 72 7129 4 























Table 2: Statistics of our test data sets. For the first five data sets, we used the given partition of 
training and test sets, while for the last nine data sets, we did random splittings into training 
and test sets of ratio 2:1. 


7.2 Comparison of NLDA, iNLDA, and OLDA 


In this experiment, we did a comparative study of NLDA, iNLDA, and OLDA. For the first five 
low-dimensional data sets from the UCI Machine Learning Repository, we used the given splitting 
of training and test sets. The result is summarized in Table 3. For the next nine high-dimensional 
data sets, we performed our study by repeated random splittings into training and test sets. The data 
was partitioned randomly into a training set, where each class consists of two-thirds of the whole 
class and a test set with each class consisting of one-third of the whole class. The splitting was 
repeated 20 times and the resulting accuracies of different algorithms for the first ten splittings are 
summarized in Table 4. Note that the mean accuracy for the 20 different splittings will be reported 
in the next section. The rank of three scatter matrices, Sp, Sw, and S+, for each of the splittings is 
also reported. 
The main observations from Table 3 and Table 4 include: 


e For the first five low-dimensional data sets, we have rank(S,) = k — 1, and rank(S,,) = 
rank(S;) = m, where m is the data dimensionality. Thus the null space of $,, is empty, and 
both NLDA and iNLDA do not apply. However, OLDA is applicable and the reduced dimen- 
sionality of OLDA is k — 1. 


e For the next nine high-dimensional data sets, condition C1: rank(S;) = rank(S,) +rank(S,,) is 
satisfied in all cases except the reO data set. For the reO data set, either rank(S,) = rank(S5) + 
rank(S,,) or rank(S,) = rank(S,) + rank(S,,) — 1 holds, that is, condition C1 is not severely 
violated for re0. Note that reO has the smallest number of dimensions among the nine high- 
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Data Set 
spambase balance wine waveform vowel 
NLDA — — — — — 
Method iNLDA — 
OLDA 88.17 86.60 98.33 73.20 56.28 























Sp 1 2 2 2 10 
Rank Sw 56 4 13 21 10 
S; 56 4 13 21 10 





Table 3: Comparison of NLDA, iNLDA, and OLDA on classification accuracy (in percentage) us- 
ing five low-dimensional data sets from UCI Machine Learning Repository. The ranks of 
three scatter matrices are reported. 


dimensional data sets. From the experiments, we may infer that condition C1 is more likely 
to hold for high-dimensional data. 


e NLDA, iNLDA, and OLDA achieve the same classification performance in all cases when 
condition C1 holds. The empirical result confirms the theoretical analysis in Section 5. This 
explains why NLDA and OLDA often achieve similar performance for high-dimensional data. 
We can also observe that NLDA and iNLDA achieve similar performance in all cases. 


e The numbers of training data points for the nine high-dimensional data (in the same order as 
in the table) are 325, 212, 140, 280, 450, 210, 125, 68, and 48, respectively. By examining 
the rank of S, in Table 4, we can observe that the training data in six out of nine data sets, 
including tr41, ORL, AR, GCM, colon, and ALLAML4, are linearly independent. That is, the 
independence assumption C2 from Theorem 5.3 holds for these data sets. It is clear from the 
table that for these six data sets, condition C1 holds and NLDA, iNLDA, and OLDA achieve 
the same performance. These are consistent with the theoretical analysis in Section 5. 


e For the reO data set, where condition C1 does not hold, i.e., rank(S;) < rank(S,) +rank(Sw), 
OLDA achieves higher classification accuracy than NLDA and iNLDA. Recall that the re- 
duced dimensionality of OLDA equals rank(S,) = q. The reduced dimensionality in NLDA 
and iNLDA equals the dimension of the null space of $,,, which equals rank(S;) — rank(S,,) < 
rank(S,). That is, OLDA keeps more dimensions in the transformed space than NLDA and 
iNLDA. Experimental results in re0 show that these extra dimensions used in OLDA improve 
its classification performance. 


7.3 Comparative Studies on Classification 


In this experiment, we conducted a comparative study of NLDA, iNLDA, OLDA, ULDA, ROLDA, 
and SVM in terms of classification. For ROLDA, the optimal À is estimated through cross-validation 
on a candidate set, A = {À nue Recall that T(|A|) denotes the computational cost of the model 
selection procedure in ROLDA, where |A| is the size of the candidate set of the regularization values. 
We have performed model selection on all nine high-dimensional data sets using different values of 
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Data Set Method Ten different splittings into training and test sets of ratio 2:1 


NLDA | 92.73 93.33 03.33 93.94 9455 95.15 96.36 95.15 92.12 93.94 
iNLDA | 92.73 93.33 093.33 93.94 94.55 95.15 96.36 95.15 92.12 93.94 








rel OLDA | 92.73 93.33 93.33 93.94 9455 95.15 96.36 95.15 92.12 93.94 
Sp 4 4 4 4 4 4 4 4 4 4 
Sw 316 318 319 316 316 320 316 318 317 318 
S; 320 322 323 320 320 324 320 322 321 322 





NLDA | 64.81 62.04 64.81 68.52 87.96 70.37 71.30 73.5 87.04 75.93 
iNLDA | 65.74 62.04 64.81 6944 87.96 70.37 71.30 72.22 87.04 75.93 


reO OLDA | 75.93 75.00 77.78 74.07 87.96 80.56 74.07 78.70 87.04 79.63 
Sp 3 3 3 3 3 3 3 3 3 3 
Sw 205 204 203 203 205 204 201 203 203 205 
S; 207 206 205 205 208 206 203 205 206 207 





NLDA | 97.4 95.71 97.14 98.57 97.14 98.57 100.0 095.71 98.57 95.71 
iNLDA | 97.14 95.71 97.14 98.57 97.14 98.57 100.0 95.71 98.57 95.71 


tr41 OLDA | 97.14 95.71 97.14 98.57 97.14 98.57 100.0 95.71 98.57 95.71 
Sp 6 6 6 6 6 6 6 6 6 6 
Sw 133 133 133 133 133 133 133 133 133 133 
S; 139 139 139 139 139 139 139 139 139 139 





NLDA | 99.17 96.67 98.33 98.33 95.00 95.83 98.33 97.50 98.33 95.83 
iNLDA | 99.17 96.67 98.33 98.33 95.00 95.83 98.33 97.50 98.33 95.83 


ORL OLDA | 99.17 96.67 98.33 98.33 95.00 95.83 98.33 97.50 98.33 95.83 
Sp 39 39 30 39 39 39 39 39 39 39 
Sw 240 240 240 240 240 240 240 240 240 240 
S; 279 279 279 279 279 279 279 279 279 279 





NLDA | 96.50 94.50 9650 94.00 93.50 94.50 93.50 97.00 94.00 96.00 
iNLDA | 96.50 94.50 96.50 94.00 93.50 94.50 93.50 97.00 94.00 96.00 


AR OLDA | 96.50 94.50 96.50 94.00 93.50 94.50 93.50 97.00 94.00 96.00 
Sp 49 49 49 49 49 49 49 49 49 49 
Sw 400 400 400 400 400 400 400 400 400 400 
5, 449 449 449 449 449 449 449 449 449 449 





NLDA | 98.80 97.78 98.89 97.78 98.89 98.89 98.89 97.78 98.89 97.78 
iNLDA | 98.89 97.78 98.89 97.78 98.89 98.89 98.89 97.78 98.890 97.78 


PIX OLDA | 98.80 97.78 98.80 97.78 98.89 98.89 98.80 97.78 98.89 97.78 
Sp 29 29 29 29 29 29 29 29 29 29 
Sw 178 179 179 179 178 180 179 179 180 178 
S; 207 208 208 208 207 209 208 208 209 207 





NLDA | 81.54 80.00 81.54 83.08 84.62 87.69 75.38 78.46 84.62 83.08 
iNLDA | 81.54 80.00 81.54 83.08 8462 87.69 75.38 78.46 84.62 83.08 


GCM OLDA | 81.54 80.00 81.54 83.08 84.62 87.69 75.38 78.46 84.62 83.08 
Sp 13 13 13 13 13 13 13 13 13 13 
Sw 111 111 111 111 111 111 111 111 111 111 
S; 124 124 124 124 124 124 124 124 124 124 





NLDA | 91.18 94.12 100.0 97.06 291.18 91.18 97.06 94.12 94.12 97.06 
iNLDA | 91.18 94.12 100.0 97.06 91.18 91.18 97.06 94.12 94.12 97.06 


colon OLDA | 91.18 94.12 100.0 97.06 91.18 91.18 97.06 94.12 9412 97.06 
Sp 1 1 1 1 1 1 1 1 1 1 
Sw 66 66 66 66 66 66 66 66 66 66 
S; 67 67 67 67 67 67 67 67 67 67 





NLDA | 95.83 91.67 95.83 95.83 87.50 95.83 95.83 100.0 91.67 95.83 
iNLDA | 95.83 91.67 95.83 95.83 87.50 95.83 95.83 100.0 91.67 95.83 
ALLAML4 OLDA | 95.83 91.67 95.83 95.83 87.50 95.83 95.83 100.0 91.67 95.83 

















So 3 3 3 3 3 3 3 3 3 3 
Sis 44 44 44 44 44 44 44 44 44 44 
Sı 47 47 47 47 47 47 47 47 47 47 





Table 4: Comparison of classification accuracy (in percentage) for NLDA, iNLDA, and OLDA us- 
ing nine high-dimensional data sets. Ten different splittings into training and test sets of 
ratio 2:1 (for each of the k classes) are applied. The rank of three scatter matrices for each 
splitting is reported. 
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Data Set NLDA iNLDA OLDA ULDA ROLDA SVM 

rel 94.33 (1.72) 94.33 (1.72) 94.33 (1.72) 94.76 (1.67) 94.79 (1.64) 94.54 (1.88) 
reO 74.03 (9.22) 74.15 (8.19) 79.54 (4.73) 79.72 (4.82) 85.79 (3.66) — 85.87 (3.34) 
tr41 97.00 (2.01) 97.00 (2.01) 97.00 (2.01) 97.14 (2.02) 97.17 (2.04) 97.14 (2.01) 
ORL 97.29 (1.79) | 97.29 (1.79) 97.29 (1.79) 92.75 (1.82) 97.52 (1.64) 97.55 (1.34) 
AR 95.42 (1.30) 95.42 (1.30) 95.42 (1.30) 94.37 (1.46) 97.30 (1.32) — 95.75 (1.43) 
PIX 98.22 (1.41) 98.22 (1.41) 98.22 (1.41) 96.61 (1.92) 98.29 (1.32) — 98.50 (1.24) 
GCM 81.77 (3.61) 81.77 (3.61) 81.77 (3.61) 80.46 (3.71) 82.69 (3.42) 75.31 (4.45) 
Colon 86.50 (5.64) 86.50 (5.64) 86.50 (5.64) 86.50 (5.64) 87.00 (6.16) 87.25 (5.25) 
ALLAML4 || 93.54 (3.70) 93.54 (3.70) 93.54 (3.70) 93.75 (3.45) 93.75 (3.45) 93.70 (3.40) 

















Table 5: Comparison of classification accuracy (in percentage) for six different methods: NLDA, 
iNLDA, OLDA, ULDA, ROLDA, and SVM using nine high-dimensional data sets. The 
mean accuracy and standard deviation (in parenthesis) from 20 different runs are reported. 


|A|. We have observed that T (|A|) grows slowly as |A| increases, and the ratio, T (1024) /T (1), on 
all nine data sets ranges from 1 to 5. Thus, we can run model selection using a large candidate set 
of regularization values, without dramatically increasing the cost. In the following experiments, we 
apply model selection to ROLDA with a candidate set of size |A| = 1024, where 


Àj = 0;/(1— 0), (28) 


with {a Ja uniformly distributed between 0 and 1. As for SVM, we employed the cross-validation 
to estimate the optimal parameter using a candidate set of size 50. To compare different classifi- 
cation algorithms, we applied the same experimental setting as in Section 7.2. The splitting into 
training and test sets of ratio 2:1 (for each of the k classes) was repeated 20 times. The final accu- 
racy reported was the average of the 20 different runs. The standard deviation for each data set was 
also reported. The result on the nine high-dimensionality data sets is summarized in Table 5. 

As observed in Section 7.2, OLDA has the same performance as NLDA and iNLDA in all 
cases except the reO data set, while NLDA and iNLDA achieve similar performance in all cases. 
Overall, ROLDA and SVM are very competitive with other methods. SVM performs well in all 
cases except GCM. The poor performance of SVM in GCM has also been observed in (Li et al., 
2004). ROLDA outperforms OLDA for re0, AR, and GCM, while it is comparable to OLDA for 
all other cases. This confirms the effectiveness of the regularization applied in ROLDA. Note that 
from Remark 1, ULDA is closely related to OLDA. However, unlike OLDA, ULDA does not apply 
the final orthogonalization step. Experimental result in Table 5 confirms the effectiveness of the 
orthogonalization step in OLDA, especially for three face image data sets and GCM. 


8. Conclusions 


In this paper, we present a computational and theoretical analysis of two LDA based algorithms, 
including null space LDA and orthogonal LDA. NLDA computes the discriminant vectors in the 
null space of the within-class scatter matrix, while OLDA computes a set of orthogonal discrimi- 
nant vectors via the simultaneous diagonalization of the scatter matrices. They have been applied 
successfully in many applications, such as document classification, face recognition, and gene ex- 
pression data classification. 


1201 


YE AND XIONG 


Both NLDA and OLDA result in orthogonal transformations. However, they applied different 
schemes in deriving the optimal transformation. Our theoretical analysis in this paper shows that un- 
der a mild condition C1 which holds in many applications involving high-dimensional data, NLDA 
is equivalent to OLDA. Based on the theoretical analysis, an improved algorithm for null space 
LDA algorithm, called iNLDA, is proposed. We have performed extensive experimental studies on 
14 data sets, including both low-dimensional and high-dimensional data. Results have shown that 
condition C1 holds for eight out of the nine high-dimensional data sets, while the null space of S,, 
is empty for all five low-dimensional data. Thus, NLDA may not be applicable for low-dimensional 
data, while OLDA is still applicable in this case. Results are also consistent with our theoretical 
analysis. That is, for all cases when condition C1 holds, NLDA, iNLDA, and OLDA achieve the 
same classification performance. We also observe that for other cases with condition C1 violated, 
OLDA outperforms NLDA and iNLDA, due to the extra number of dimensions used in OLDA. We 
also compare NLDA, iNLDA, and OLDA with uncorrelated LDA (ULDA), which does not perform 
the final orthogonalization step. Results show that OLDA is very competitive with ULDA, which 
confirms the effectiveness of the orthogonalization step used in OLDA. Our empirical and theoret- 
ical results presented in this paper provide further insights into the nature of these two LDA based 
algorithms. 

We also present the ROLDA algorithm, which extends the OLDA algorithm by applying the 
regularization technique. Regularization may stabilize the sample covariance matrix estimation and 
improve the classification performance. ROLDA involves the regularization parameter A, which 
is commonly estimated via cross-validation. To speed up the cross-validation process, we decom- 
pose the computations in ROLDA into two components: the first component involves matrices of 
high dimensionality but independent of À, while the second component involves matrices of low 
dimensionality. When searching for the optimal À from a candidate set, we repeat the computations 
involved in the second component only. A comparative study on classification shows that ROLDA 
is very competitive with OLDA, which shows the effectiveness of the regularization applied in 
ROLDA. 

Our extensive experimental studies have shown that condition C1 holds for most high-dimensional 
data sets. We plan to carry out theoretical analysis on this property in the future. Some of the theo- 
retical results in (Hall et al., 2005) may be useful for our analysis. 

The algorithms in (Yang et al., 2005; Yu and Yang, 2001) are closely related to the null space 
LDA algorithm discussed in this paper. The analysis presented in this paper may be useful in un- 
derstanding why these algorithms perform well in many applications, especially in face recognition. 
We plan to explore this further in the future. 
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